Introduction to Statistics in R

0. Course Description

1. Summary Statistics

1.1 Mean and median

2. Random Numbers and Probability

More Distributions and the Central Limit Theorem

4. Correlation and Experimental Design

Libraries

library(tidyverse)
library(anytime)
library(lubridate)
library(assertive)
library(fst)
library(broom)
library(plot3D)
library(magrittr)
library(infer)
library(anytime)

Data

food_consumption <- readRDS("food_consumption.rds")
amir_deals <- readRDS("seller_1.rds")
world_happiness <- readRDS("world_happiness_sugar.rds")

0. Course Description

Statistics is the study of how to collect, analyze, and draw conclusions from data. It’s a hugely valuable tool that you can use to bring the future into focus and infer the answer to tons of questions. For example, what is the likelihood of someone purchasing your product, how many calls will your support team receive, and how many jeans sizes should you manufacture to fit 95% of the population? In this course, you’ll use sales data to discover how to answer questions like these as you grow your statistical skills and learn how to calculate averages, use scatterplots to show the relationship between numeric values, and calculate correlation. You’ll also tackle probability, the backbone of statistical reasoning, and learn how to conduct a well-designed study to draw your own conclusions from data.

1. Summary Statistics

Summary statistics gives you the tools you need to boil down massive datasets to reveal the highlights. In this chapter, you’ll explore summary statistics including mean, median, and standard deviation, and learn how to accurately interpret them. You’ll also develop your critical thinking skills, allowing you to choose the best summary statistics for your data.

1.1 Mean and median

In this chapter, you’ll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO, per person per year in each country.

## # A tibble: 2 × 3
##   country mean_consumption median_consumption
##   <chr>              <dbl>              <dbl>
## 1 Belgium             42.1               12.6
## 2 USA                 44.6               14.6

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.2 Quartiles, quantiles, and quintiles

Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the dataset. For example, you might want to give a discount to the 10% most active users on a website.

# Calculate the quartiles of co2_emission
quantile(food_consumption$co2_emission)

##        0%       25%       50%       75%      100% 
##    0.0000    5.2100   16.5300   62.5975 1712.0000

# Calculate the quintiles of co2_emission
quantile(food_consumption$co2_emission, probs=c(0, 0.2, 0.4, 0.6, 0.8, 1))

##       0%      20%      40%      60%      80%     100% 
##    0.000    3.540   11.026   25.590   99.978 1712.000

# Calculate the deciles of co2_emission
quantile(food_consumption$co2_emission, probs = seq(0, 1, 0.1))

##       0%      10%      20%      30%      40%      50%      60%      70% 
##    0.000    0.668    3.540    7.040   11.026   16.530   25.590   44.271 
##      80%      90%     100% 
##   99.978  203.629 1712.000

1.3 ariance and standard deviation

Variance and standard deviation are two of the most common ways to measure the spread of a variable, and you’ll practice calculating these in this exercise. Spread is important since it can help inform expectations. For example, if a salesperson sells a mean of 20 products a day, but has a standard deviation of 10 products, there will probably be days where they sell 40 products, but also days where they only sell one or two. Information like this is important, especially when making predictions.

# Calculate variance and sd of co2_emission for each food_category
food_consumption %>% 
  group_by(food_category) %>% 
  summarize(var_co2 =var(co2_emission),
     sd_co2 =sd(co2_emission))

## # A tibble: 11 × 3
##    food_category   var_co2  sd_co2
##    <fct>             <dbl>   <dbl>
##  1 beef          88748.    298.   
##  2 eggs             21.4     4.62 
##  3 fish            922.     30.4  
##  4 lamb_goat     16476.    128.   
##  5 dairy         17672.    133.   
##  6 nuts             35.6     5.97 
##  7 pork           3095.     55.6  
##  8 poultry         245.     15.7  
##  9 rice           2281.     47.8  
## 10 soybeans          0.880   0.938
## 11 wheat            71.0     8.43

# Plot food_consumption with co2_emission on x-axis
ggplot(food_consumption, aes(co2_emission)) +
  # Create a histogram
  geom_histogram()+
  # Create a separate sub-graph for each food_category
  facet_wrap(~ food_category)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.4 Finding outliers using IQR

Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that’s less influenced by outliers.

# Calculate total co2_emission per country: emissions_by_country
emissions_by_country <- food_consumption %>% group_by(country) %>%
  summarize(total_emission = sum(co2_emission))
emissions_by_country

## # A tibble: 130 × 2
##    country    total_emission
##    <chr>               <dbl>
##  1 Albania             1778.
##  2 Algeria              708.
##  3 Angola               413.
##  4 Argentina           2172.
##  5 Armenia             1110.
##  6 Australia           1939.
##  7 Austria             1211.
##  8 Bahamas             1193.
##  9 Bangladesh           374.
## 10 Barbados             889.
## # … with 120 more rows

# Compute the first and third quantiles and IQR of total_emission
q1 <- quantile(emissions_by_country$total_emission, 0.25)
q3 <- quantile(emissions_by_country$total_emission, 0.75)
iqr <- q3 -q1

# Calculate the lower and upper cutoffs for outliers
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
# Filter emissions_by_country to find outliers
emissions_by_country %>% filter(total_emission < lower | total_emission > upper)

## # A tibble: 1 × 2
##   country   total_emission
##   <chr>              <dbl>
## 1 Argentina          2172.

2. Random Numbers and Probability

In this chapter, you’ll learn how to generate random samples and measure chance using probability. You’ll work with real-world sales data to calculate the probability of a salesperson being successful. Finally, you’ll use the binomial distribution to model events with binary outcomes.

2.1 Calculating probabilities

You’re in charge of the sales team, and it’s time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he’s worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you’ll first figure out what the chances are of selecting certain deals.

# Count the deals for each product
amir_deals %>% count(product)

##      product  n
## 1  Product A 23
## 2  Product B 62
## 3  Product C 15
## 4  Product D 40
## 5  Product E  5
## 6  Product F 11
## 7  Product G  2
## 8  Product H  8
## 9  Product I  7
## 10 Product J  2
## 11 Product N  3

# Calculate probability of picking a deal with each product
amir_deals %>% count(product) %>% mutate(prob = n / sum(n))

##      product  n       prob
## 1  Product A 23 0.12921348
## 2  Product B 62 0.34831461
## 3  Product C 15 0.08426966
## 4  Product D 40 0.22471910
## 5  Product E  5 0.02808989
## 6  Product F 11 0.06179775
## 7  Product G  2 0.01123596
## 8  Product H  8 0.04494382
## 9  Product I  7 0.03932584
## 10 Product J  2 0.01123596
## 11 Product N  3 0.01685393

# Set random seed to 31
set.seed(31)
# Sample 5 deals without replacement
amir_deals %>% sample_n(5)

##     product  client status  amount num_users
## 1 Product D Current   Lost 3086.88        55
## 2 Product C Current   Lost 3727.66        19
## 3 Product D Current   Lost 4274.80         9
## 4 Product B Current    Won 4965.08         9
## 5 Product A Current    Won 5827.35        50

# Sample 5 deals with replacement
amir_deals %>% sample_n(5, replace = TRUE)

##     product  client status  amount num_users
## 1 Product A Current    Won 6010.04        24
## 2 Product B Current   Lost 5701.70        53
## 3 Product D Current    Won 6733.62        27
## 4 Product F Current    Won 6780.85        80
## 5 Product C Current    Won -539.23        11

2.2 Creating a probability distribution

A new restaurant opened a few months ago, and the restaurant’s management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. In this exercise, you’ll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the restaurant_groups data frame.

group_id <- c("A", "B", "C", "D", "E", "F", "G",' H', "I", "J")
group_size <- c(2, 4, 6, 2, 2, 2, 3, 2, 4, 2)
restaurant_groups <- data.frame(group_id, group_size)

# Create probability distribution, # Count number of each group size
size_distribution <- restaurant_groups %>% count(group_size) %>%
  # Calculate probability
  mutate(probability = n / sum(n))
size_distribution

##   group_size n probability
## 1          2 6         0.6
## 2          3 1         0.1
## 3          4 2         0.2
## 4          6 1         0.1

# Calculate probability of picking group of 4 or more, Filter for groups of 4 or larger
size_distribution %>% filter(group_size >= 4) %>%
  # Calculate prob_4_or_more by taking sum of probabilities
  summarize(prob_4_or_more = sum(probability))

##   prob_4_or_more
## 1            0.3

2.3 Continuous distributions

The sales software used at your company is set to automatically back itself up, but no one knows exactly what time the back-ups happen. It is known, however, that back-ups happen exactly every 30 minutes. Amir comes back from sales meetings at random times to update the data on the client he just met with. He wants to know how long he’ll have to wait for his newly-entered data to get backed up. Use your new knowledge of continuous uniform distributions to model this situation and answer Amir’s questions.

# Min and max wait times for back-up that happens every 30 min
min <- 0
max <- 30
# Calculate probability of waiting less than 5 mins
prob_less_than_5 <- prob_less_than_5 <- punif(5, min, max)
prob_less_than_5

## [1] 0.1666667

# Calculate probability of waiting greater than 5 mins
prob_greater_than_5 <- punif(punif(5, min, max, lower.tail = FALSE))
prob_greater_than_5

## [1] 0.8333333

# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 <- punif(20, min, max) - punif(10, min, max)
prob_between_10_and_20

## [1] 0.3333333

# Set random seed to 334
set.seed(334)
# Generate 1000 wait times between 0 and 30 mins, save in time column
wait_times <- data.frame(c(1:1000))
head(wait_times %>% mutate(time = runif(1000, 0, 30)))

##   c.1.1000.       time
## 1         1 29.4531546
## 2         2 16.2911178
## 3         3  0.3692466
## 4         4 24.1867736
## 5         5 23.4260131
## 6         6 15.9249217

# Generate 1000 wait times between 0 and 30 mins, save in time column
wait_times %>% mutate(time = runif(1000, min = 0, max = 30)) %>%
  # Create a histogram of simulated times
  ggplot(aes(time)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.4 The binomial distribution

Assume that Amir usually works on 3 deals per week, and overall, he wins 30% of deals he works on. Each deal has a binary outcome: it’s either lost, or won, so you can model his sales deals with a binomial distribution. In this exercise, you’ll help Amir simulate a year’s worth of his deals so he can better understand his performance.

# Set random seed to 10
set.seed(10)
# Simulate a single deal, 3 deals per week, and overall, he wins 30% of deals
rbinom(1, 1, 0.3)

## [1] 0

# Simulate 1 week of 3 deals
rbinom(1, 3, 0.3)

## [1] 0

# Simulate 52 weeks of 3 deals
deals <- rbinom(52, 3, 0.3)
# Calculate mean deals won per week
mean(deals)

## [1] 0.8076923

# Probability of closing 3 out of 3 deals
dbinom(3, 3, 0.3)

## [1] 0.027

# Probability of closing <= 1 deal out of 3 deals
pbinom(1, 3, 0.3)

## [1] 0.784

# Probability of closing > 1 deal out of 3 deals
pbinom(1, 3, 0.3, lower.tail = FALSE)

## [1] 0.216

More Distributions and the Central Limit Theorem

It’s time to explore one of the most important probability distributions in statistics, normal distribution. You’ll create histograms to plot normal distributions and gain an understanding of the central limit theorem, before expanding your knowledge of statistical functions by adding the Poisson, exponential, and t-distributions to your repertoire.

3.1 The normal distribution

Since each deal Amir worked on (both won and lost) was different, each was worth a different amount of money. These values are stored in the amount column of amir_deals As part of Amir’s performance review, you want to be able to estimate the probability of him selling different amounts, but before you can do this, you’ll need to determine what kind of distribution the amount variable follows.

# Histogram of amount with 10 bins
ggplot(amir_deals, aes(amount)) + geom_histogram(bins = 10, fill = "green")

# What's the probability of Amir closing a deal worth less than $7500?
pnorm(7500, 5000, 2000)

## [1] 0.8943502

# What's the probability of Amir closing a deal worth more than $1000?
pnorm(1000, 5000, 2000, lower.tail = FALSE)

## [1] 0.9772499

#What amount will 75% of Amir's sales be more than?
qnorm(0.75, 5000, 2000, lower.tail = FALSE)

## [1] 3651.02

3.2 The central limit theorem

The central limit theorem states that a sampling distribution of a sample statistic approaches the normal distribution as you take more samples, no matter the original distribution being sampled from.

In this exercise, you’ll focus on the sample mean and see the central limit theorem in action while examining the num_users column of amir_deals more closely, which contains the number of people who intend to use the product Amir is selling.

# Create a histogram of num_users
ggplot(amir_deals, aes(num_users)) + geom_histogram(bins = 10)

# Set seed to 104
set.seed(104)
# Sample 20 num_users with replacement from amir_deals
sample(amir_deals$num_users, size = 20, replace = TRUE) %>%
  # Take mean
  mean()

## [1] 30.35

# Repeat the above 100 times
sample_means <- replicate(100, sample(amir_deals$num_users, size = 20, replace = TRUE) %>% mean())
# Create data frame for plotting
samples <- data.frame(mean = sample_means)
# Histogram of sample means
ggplot(samples, aes(mean)) + geom_histogram(bins = 10)

3.3 The Poisson distribution

Your company uses sales software to keep track of new sales leads. It organizes them into a queue so that anyone can follow up on one when they have a bit of free time. Since the number of lead responses is a countable outcome over a period of time, this scenario corresponds to a Poisson distribution. On average, Amir responds to 4 leads each day. In this exercise, you’ll calculate probabilities of Amir responding to different numbers of leads.

# What's the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4?
dpois(5, 4)

## [1] 0.1562935

# What's the probability that Amir responds to 2 or fewer leads in a day?
ppois(2, 4)

## [1] 0.2381033

# What's the probability that Amir responds to more than 10 leads in a day?
ppois(10, 4, lower.tail = FALSE)

## [1] 0.002839766

3.4 More probability distributions

To further evaluate Amir’s performance, you want to know how much time it takes him to respond to a lead after he opens it. On average, it takes 2.5 hours for him to respond. In this exercise, you’ll calculate probabilities of different amounts of time passing between Amir receiving a lead and sending a response.

# What's the probability it takes Amir less than an hour to respond to a lead?
pexp(1, 1/2.5)

## [1] 0.32968

# Probability response takes > 4 hours
pexp(4, 1/2.5, lower.tail = FALSE)

## [1] 0.2018965

# Probability response takes 3-4 hours
pexp(4, 1/2.5) - pexp(3, 1/2.5)

## [1] 0.09929769

4. Correlation and Experimental Design

In this chapter, you’ll learn how to quantify the strength of a linear relationship between two variables, and explore how confounding variables can affect the relationship between two other variables. You’ll also see how a study’s design can influence its results, change how the data should be analyzed, and potentially affect the reliability of your conclusions.

4.1 Relationships between variables

In this chapter, you’ll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

# Create a scatterplot of happiness_score vs. life_exp
ggplot(world_happiness, aes(life_exp, happiness_score)) + geom_point()

# Add a linear trendline to scatterplot
ggplot(world_happiness, aes(life_exp, happiness_score)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Correlation between life_exp and happiness_score
cor(world_happiness$life_exp, world_happiness$happiness_score)

## [1] 0.7737615

4.2 What can’t correlation measure?

While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it’s far from perfect. In this exercise, you’ll explore one of the caveats of the correlation coefficient by examining the relationship between a country’s GDP per capita (gdp_per_cap) and happiness score.

# Scatterplot of gdp_per_cap and life_exp
ggplot(world_happiness, aes(gdp_per_cap, life_exp)) + geom_point()

# Correlation between gdp_per_cap and life_exp
cor(world_happiness$gdp_per_cap, world_happiness$life_exp)

## [1] 0.7235027

4.3 Transforming variables

When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you’ll perform a transformation yourself.

# Scatterplot of happiness_score vs. gdp_per_cap
ggplot(world_happiness, aes(gdp_per_cap, happiness_score)) + geom_point()

# Calculate correlation
cor(world_happiness$happiness_score, world_happiness$gdp_per_cap)

## [1] 0.7601853

# Create log_gdp_per_cap column
world_happiness <- world_happiness %>% mutate(log_gdp_per_cap = log(gdp_per_cap))

# Scatterplot of happiness_score vs. log_gdp_per_cap
ggplot(world_happiness, aes(log_gdp_per_cap, happiness_score)) + geom_point()

# Calculate correlation
cor(world_happiness$happiness_score,world_happiness$log_gdp_per_cap)

## [1] 0.7965484

# Scatterplot of grams_sugar_per_day and happiness_score
ggplot(world_happiness, aes(grams_sugar_per_day, happiness_score)) + geom_point()

# Correlation between grams_sugar_per_day and happiness_score
cor(world_happiness$happiness_score,world_happiness$grams_sugar_per_day)

## [1] 0.69391

The End.

Thanks DataCamp

- My Favorite Team -